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ABSTRACT 


A  special  structure  in  dynamic  programming  which  has  been  studied  by 
Bellman, Blackwell,^ D'Epenoux,^^^  Derman,^®^  Howard, Manne,^ 
Oliver Wolfe  and  Dantzig,^^^^  and  others  is  the  problem  of  program¬ 
ming  over  a  Markov  chain.  This  paper  extends  their  results  and  solution 
algorithms  to  programming  over  a  Markov  -  renewal  process  ~  in  which 
times  between  transitions  of  the  system  from  state  i  to  state  j  are  in¬ 
dependent  samples  from  an  inter -transition  distribution  which  may  depend 
on  both  i  and  j  .  For  these  processes,  a  general  reward  structure  and  a 
decision  mechanism  are  postulated;  the  problem  is  to  make  decisions  at 
each  transition  to  maximize  the  total  expected  reward  at  the  end  of  the 
planning  horizon. 

For  finite -horizon  problems,  or  infinite-horizon  problems  with  discounting, 
there  is  no  difficulty;  the  results  are  similar  to  previous  work,  expect  for  a 
new  dependency  upon  the  transition  -  time  distributions  being  generally 
present.  In  the  cases  where  the  horizon  extends  towards  infinity,  or  when 
discounting  vanishes,  however,  a  fundamental  dichotomy  in  the  optimal 
solutions  may  occur.  It  then  becomes  important  to  specify  whether  the 
limiting  experiment  is:  (i)  undiscounted,  with  the  number  of  transitions 
n  00  ,  (ii)  undiscounted,  with  a  time  horizon  t  -*  oo  ,  or  (iii)  infinite  n  or 
t  ,  with  discount  factor  a  0  .  In  each  case,  a  limiting  form  fpr  the  total 
expected  reward  is  shown,  and  an  algorithm  developed  to  maximize  the  rate 
of  return.  The  problem  of  finding  the  optimal  or  near -optimal  policies  in 
the  case  of  ties  in  rate  of  return  is  still  computationally  unresolved. 

Extensions  to  non-ergodic  processes  are  indicated,  and  special  results  for 
the  two- state  process  are  presented.  Finally,  an  example  of  machine 
maintenance  and  repair  is  used  to  illustrate  the  generality  of  the  approach 
and  the  special  problems  which  may  arise. 
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I.  Introduction 

An  important  special  structure  of  dynamic  programming  occurs  in 
the  Markov  decision  processes  fir  st  formulated  by  Bellman,  ^  ^  H  ^  1  developed 
extensively  by  Howard,  ^  ^  ^  and  further  analysed  by  Oliver,  ^  ^  Manne,  ^  ^ 

D'Epenoux,  ^  ^  ^  Blackwell,  ^  ^  ^  Wolfe  and  Dautzig,  ^  ^  Derman,  ^  ®  ^  and 

others.  In  this  model,  the  system  makes  Markovian  transitions  from  one 
to  another  of  a  finite  set  of  states,  accumulating  a  reward  at  each  transition. 

A  decision  is  made  at  each  step  from  among  a  finite  number  of  alternatives; 
this  decision  affects  both  the  transition  probabilities  and  the  rewards  ob¬ 
tained  upon  leaving  the  present  state.  The  problem  is  to  specify  the  policy 
of  decisions  to  be  made  in  each  state  which  will  maximize  the  total  expected 
return  at  the  end  of  the  experiment. 

To  1 

The  following  cases  are  formulated  by  Howardr  '' 

I.  Discrete -parameter  Markov  chain 

A.  Finite  number  of  transitions 

B.  Infinite  number  of  transitions 

C.  Repeat  of  both  cases  with  discounting 

II.  Continuous -parameter  Markov  chain 

A.  Finite  planning  horizon 

B.  Infinite  planning  horizon 

C.  Repeat  of  both  cases  with  discounting 

The  models  with  finite  horizons  are  all  expressed  in  terms  of  the  usual 
recursive  relationships  of  dynamic  programming,  ^  ^  ^  ^  whose  solution 

techniques  are  well-known.  Howard's  contribution  was  the  development  of 
simple,  finite,  iterative  techniques  to  find  the  optinuil  stationary  policies 
to  be  followed  in  the  infinite  cases;  since  the  total  reward  is  unbounded  in 
the  undiscounted.  Infinite  horizon  cases,  the  rate  of  return  becomes  the 


•yttem  objective.  Blackwell^  shows  a  similar  algorithm  for  maximizing 
the  return  for  a  vanishing  discount  factor,  and  proves  that  among  the  optimal 
policies,  there  is  one  which  is  stationary. 

The  purpose  of  this  paper  is  to  generalize  all  of  the  above  models 
and  algorithms  of  Markov  decision  processes  to  a  larger  class  of  dynamic 
models  in  which  the  Markov-renewal  process  is  used  to  describe  system 
behavior.  The  important  generalization  provided  by  these  processes  is  that 
time  spent  by  the  system  between  transitions  may  be  a  random  variable.  The 
resulting  Markov* renewal  decision  processes  will  be  seen  to  embrace  a 
wider  range  of  important  operational  problems,  without  seriously  compli¬ 
cating  the  calculation  of  optimal  policies. 

The  first  three  sections  of  the  paper  describe  the  properties  of  the 
Markov-renewal  process,  the  reward  structure  assumed,  and  the  decision 
process.  The  first  cases  analyzed  are  the  finite-step  and  finite-time 
problems,  both  discounted  and  undiscounted.  Next,  discounted  problems 
with  an  infinite  horizon  are  examined,  followed  by  a  discussion  of  the  diffi¬ 
culties  of  undiscounted,  infinite -horizon  models.  Three  distinct  "infinite" 
cases  are  presented,  with  some  remarks  on  the  problems  of  ties  and  near- 
optimal  policies.  Extensions  to  nonergodic  structures  are  described  in  the 
next  section,  followed  by  a  comparison  with  some  previous  results.  The 
paper  closes  with  explicit  formulae  for  the  two- state  process,  a  machine 
maintenance-repair  example,  and  suggestions  for  further  research. 
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II.  Markov* Renewal  Proceeses 


The  Markov- renewal  processes  and  the  related  semi-Markov 
processes  were  first  studied  by  Levy,  Smith,  and  Takacs  independently  in 
1954.  In  References  [  15]  and  [16]  Pyke  has  summarized  current  results 
in  this  area,  together  with  independent  contributions  and  an  exhaustive  list 
of  references.  Other  specific  results  are  in  References  [17],  [18],  and 
[1]. 

Loosely  speaking,  Markov -renewal  processes  are  generalizations 
of  both  the  discrete-  and  continuous -parameter  Markov  chains  in  which  the 
time  between  transitions  of  the  system  from  state  i  to  state  j  is  a  random 
variable  obtained  from  a  distribution  which  depends  upon  both  i  and  j  . 

We  shall  only  be  concerned  with  Markov-renewal  processes  with  a  finite 
number  of  states,  ^  ^  labelled  by  some  integer  i  ,  i  =  1,  2, . .  . ,  N  . 

A  particular  realization  of  a  Markov-renewal  process  consists  of  an 
initial  integer  i^  ,  followed  by  pairs  of  random  variables,  one  of  which  is 
an  integer,  and  the  other  a  non-negative  variable  r  ,  viz: 

^0  ’  ^1’  ’  ^2’  '  ^3* 

The  integer  i^  represents  the  initial  state  of  the  system  at  time  zero;  it 
may  be  given  uniquely,  or  determined  from  some  initial  distribution.  The 
sequence  of  integers  ij.i^.i^,  •  •  •  represents  the  successive  states  of  the 
system  as  it  makes  transitions  between  its  allowed  states  at  steps  i,  2,  3,  .  . . 
These  integers  are  generated  by  a  Markov  process,  so  that  the  conditional 
probability  distribution 

Pij  =  Pr  =  j  hk  ^  ^  k  =  0,  1, 2,  . .  .  (1) 

j  "  •  •  •  I  N 
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contains  all  of  the  information  necessary  to  generate  the  successive  states 
of  the  system,  once  i^  is  known  J  ^  ^  ^ 

The  sequence  of  non-negative  variables  represents  the  transition 
intervals  between  successive  states.  Thus  is  the  time  between 

the  instants  the  system  entered  state  ij^  at  step  k  and  the  time  it  entered 
state  ij^^^  at  the  next  transition.  It  is  not  necessary  to  describe  the  state 
of  the  system  between  these  transition  instants,  in  general;  however,  for 
convenience,  one  may  speak  of  the  system  as  being  in  state  i  ,  headed  toward 
state  j  .  Notice  that  it  is  necessary  to  select  the  next  state  immediately 
upon  entering  a  given  state  so  that  the  transition  interval  can  be  determined; 
in  Markov -renewal  processes  this  transition  interval  is  determined  from 
the  stationary  distribution  functions: 


F  .(t)  =  Pr  {T(i,j)  <  t}  t>0  (2) 

•J 

i,J  -  1,2,...,N  • 

(nl  n 

The  moments  of  this  distribution  will  be  denoted  by  =  E{[T(i, j)]  }  , 

n  =  0,  1,  2,  .  . .  ;  the  superscript  (1)  is  suppressed  for  the  mean  transition 
interval.  It  is  assumed  that  F.  .(0)  =  0  for  all  i,j  ,  so  that  0  <  <  co, 

for  all  n  . 

For  convenience,  several  results  of  Pyke  which  will  be  used  in  the 
sequel  are  presented  in  Appendix  A;  full  details  and  additional  explicit 
formulae  may  be  found  in  References  [  15  ] ,  [  16  ] ,  [  17  ]  ,  [  18  ]  ,  and  [  1  ] . 
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III.  The  Reward  Structure 


We  next  describe  the  reward  structure  which  will  be  assumed  for  the 
system.  When  the  system  makes  a  transition  into  state  i  ,  heading  towards 
j  ,  we  assume  that  a  fixed  return  of  R.j  dollars  is  received.  Also,  a  vari¬ 
able  running  return  of  r^^  dollars  per  unit  time  is  assumed  to  be  generated 
during  the  transition  time,  so  that  for  a  time  t  since  the  last  transition,  a 
cumulative  reward  R^^  ^ij  *  ^  £  T(i,j))  is  generated.  The  expected 

return  from  state  i  ,  heading  towards  j  is; 


p..  =  \  (R..  +  r..t)dF..(t)  =  R.  .  +  T..V.. 

u  Jo  ‘J  'J 


(3) 


If  a  discount  factor  a  is  used, 


P^j(^) 


=  P  R. ,  r  r. .  r  e* 
'  11  ’  _ 


cyx 


IJ  IJ 


dx  dF..(t) 
ij 


r . . 


r-'/r. 


R..  +  -ii  1  -  \  e*^^dF..(t) 

ij  a  11 


^0 


ij 


(4) 


for  all  i,  j  .  More  general  reward  structures  may  also  be  used. 


IV.  The  Decision  Process 

It  remains  to  describe  the  procedure  by  which  the  system  behavior 
will  be  governed.  Let  us  assume  that  there  are  a  finite  number  of  alterna¬ 
tives,  z  =  1,2,...,Z  ,  available  in  each  state  of  the  system;  selection  of 
a  certain  alternative  then  influences  the  transition  times  and  transition 
probabilities  to  the  next  state,  as  well  as  the  rewards  to  be  obtained  during 

the  interval  until  that  transition.  To  put  it  another  way,  there  are  families 

z  z  z  z 

of  Q.j(t)  =  p.jF.j(t)  functions,  as  well  as  decision-dependent  rewards  R.j  , 
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and  reward  rates,  rf.  ,  for  each  e  =  1,  2, ....  Z  .  The  operating  policy  of 
the  system  is  a  selection  of  a  z  to  be  used  in  each  state  of  the  system, 
possibly  depending  also  on  the  remaining  length  of  the  experiment. 

To  summarize  the  system  behavior  under  the  influence  of  a  given 

policy: 

1.  System  enters  state  i  . 

2.  Alternative,  z(i)  ,  is  selected  from  among  the  available 
alternatives:  it  is  a  function  only  of  the  current  system 
state  i  ,  and  (possibly)  the  remaining  length  of  the 
experiment. 

3.  Based  upon  z(i)  ,  a  next  state  j  is  selected  as  a  sample 

from  the  conditional  probabilities,  P  :'  >  sojurn  time 

until  that  next  state  is  entered,  T(i,j;z(i))  ,  is  selected 

2^  x) 

as  a  sample  from  the  distribution  F.j  '(t)  . 

4.  For  a  clock  time  t  since  state  i  was  entered,  a 
cumulative  reward 

+  r*<'>t  0  <  t  <  T(i,j;z(i)) 

is  generated. 

5.  The  system  enters  state  j  ,  and  the  process  is  repeated 
until  the  experiment  is  terminated. 

The  fundamental  problem  which  we  shall  consider  in  this  paper  is  the 

selection  of  the  alternatives  for  each  state,  z(i)  ,  which  will  maximize 

total  expected  return  over  the  length  of  the  experiment. 

As  we  shall  see,  determination  of  this  optimal  policy  will  depend 

critically  upon  whether  discounting  is  or  is  not  used,  or  in  the  way  in  which 

certain  limiting  experiments  are  defined.  In  some  cases,  general  results 

as  to  the  optimal  policy  will  not  be  obtainable. 
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V.  Finite  Step,  Discounted  Cafe 

The  first  experiment  is  the  operation  of  the  system  for  a  fixed 
number  of  transition  steps,  n  .  Following  the  usual  procedure  of  dynamic 
programming^  ^  H  ^  1  define  for  all  i  ,  n  =  0,  1,  2, . . . 

V^(n)  =  expected  return  obtained  from  an  n-step  process 
starting  in  state  i  ,  and  using  an  optimal  policy, 
z*(i.n) 

Continuous  discounting  with  parameter  o  per  unit  time  will  be  used,  so 
that  V^(n)  also  depends  upon  o  • 

Setting  for  convenience,  V^(0)  =0  (i  =  1,  2,  . .  . ,  N)  ,  the  optimal 
eiqiected  returns  for  a  one -step  process  can  be  obtained  for  all  i  through 
the  relation: 

V^(l)  =  max  pfCo)  (5) 

z 

where 

•  1  ■>«  ■-«'“>  =  Z  ffj  Kj  *  'u 

j=l  j=l 


is  the  expected  discounted,  one -step  return,  starting  in  state  i  and  following 
policy  E  .  In  (6)  a  tilde  is  used  to  indicate  the  Laplace-Stieltjes  transform  of 
F..(t)  ,  or  the  Laplace  transform  of  its  derivative,  ^,:(t)  ,  if  it  exists: 


£y(a) 


■St 


£ij(t)dt 


(7) 


Similar  notation  for  other  transforms  will  be  used  in  the  sequel.  Using  the 
principle  of  optimality!  ^  ^ ^  ^  ^  the  recurrence  relations  for  n  s  2,3, .. . 
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and  all  i  .  are: 


V.(n)  =  max 
z 


p?.f*(a)V.(n  -  1) 
J 


(8) 


The  appearance  of  the  factor  fs  <^ue  to  the  necessity  for  discounting 

the  return  with  n  -  1  steps  left  by  an  amount  which  depends  upon  T(i,  j)  . 
The  expected  discounted  return  is  then 


e'^^V-tn  -  l)dF..{t) 
J  ij 


f^  ia)  V  .(n  -  1) 


Equation  (8)  suggests  a  simple  technique  for  computing  the  optimal  policies, 

z  (i,n)  .  One  begins  with  n  =  !»  ,  building  up  the  optimal  policies, 

and  returns  for  successively  larger  problems.  As  Bellman  and  Dreyfus 
f  4  1 

have  pointed  out*^  ■'  the  computation  is  not  complicated  so  much  by  the 
requirements  for  storing  the  sequences  {V^(n)  }  and  {z  (i.  n)}  as  it  is  by 
the  necessity  of  storing  the  4Z  matrices  (R^j)  >  i 

each  of  which  is  of  dimension  .  The  last  matrix  must  of  course  be 
recalculated  for  each  change  in  discount  factor. 

In  the  special  case  where  transition  intervals  are  all  of  fixed  length 
T  ,  then  f^j(Q')  =  e  =  P  ,.and  (8)  may  be  written  as 


N 

V.(n)  =  max  |  ^  p* 
““  j=l 


R*  +  r 
ij 


(9) 


for  i  =  1,  2,  . .  . ,  N  ;  n  =  2,  3,  .  . .  .  Upon  redefinition  of  the  expected  reward 
per  transition,  (9)  is  seen  to  be  equivalent  to  the  discounted,  finite-step 
Markov  decision  processes  studied  by  Howard^  ^  ^  and  Blackwell.  ^  ^  ^ 
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In  the  more  general  case  of  the  Markov-renewal  processt  we  obtain  a 
complicated  dependence  on  the  discount  factor  because  of  the  influence  of 
the  entire  shape  of  the  transition  time  distributions,  expressed  trough  the 
f^j(o)  •  If  the  boundary  rewards  or  penalties,  V^(0)  ,  are  not  zero,  (5) 
should  be  replaced  by: 


V.(l) 


max 

z 


pf{») 


N 

I 


p*.f®(o)  V.(0) 
J 


(5') 


or,  equivalently,  the  range  of  (8)  and  (9)  can  be  extended  to  n  =  1,2,  3 . 

Because  of  the  discounting,  the  sequence  of  expected  returns 
{V^(n)}  approaches  a  finite  limit  as  n  approaches  infinity,  for  all  o  >  0  . 
It  is  not  apparent  what  happens  to  the  sequence  of  optimal  decisions 

||g 

{s  (i,  n)}  ;  we  shall  return  to  this  point  in  a  later  section. 


VI.  Finite  Time,  Discounted  Case 

Because  the  transitions  in  a  Markov-renewal  process  occur 
stochastically  over  time,  another  possible  experiment  suggests  itself  — 
operation  of  the  system  for  a  fixed  period  of  time,  t  .  For  example,  in 
certain  operational  problems  it  may  be  more  realistic  to  think  of  a  fixed 
horizon  in  time,  rather  than  a  horizon  of  fixed  number  of  steps.  The 
optimal  policy,  z*(i,t)  ,  now  depends  upon  the  length  of  time  the  experi¬ 
ment  has  yet  to  run.  Alternatives  are  still  only  selected  at  the  transition 
instants;  however,  it  is  now  possible  to  break  off  the  experiment  in  the 
middle  of  some  transition  interval.  Define  for  t  >  0  ,  and  all  i  ; 

V.(t)  =  expected  return  obtained  from  a  process  which 

continues  for  t  units  of  time,  starting  in  state  i  , 

||| 

and  using  an  optimal  policy,  z  (i,  t)  . 
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Let  return  (or  penalty)  imposed  when  system  operation  is 

terminated  in  state  i  ,  headed  towards  j  . 

Suppressing  momentarily  the  policy  to  be  followed,  there  are  two 
possibilities:  either  the  system  has  not  transferred  out  of  the  starting  state 
during  the  observation  interval  (0,  t]  ,  or  it  has  made  a  transition  to  state 
j  at  time  x  ,  0  <  x  <  t  .  In  the  first  case,  the  total  discounted  reward  if 
the  system  was  headed  towards  state  j  is: 


R..  -t-  r..[  1 
ij  iJ 


e'^]la  +  V.j(0)e""‘ 


In  the  second  case  the  reward  is: 

R..  +  r..[  I  -  e  +  V.(t  -  x)e’*’^  . 

y  y 

The  average  discounted  return  for  t  >  0  ,  and  all  i  ,  is: 


^  r  t 

V.(t)  =  y  p..  V..(0)e'"^F^.{t)  +  R..  +  r..r  e’°*Ff.(x)dx 

i'  ij'  ly  ij  ijJ-  ij' 

j=l 


ry'°*v. 

Jo  •> 


(t  -  x)  dF^  .(x) 


(10) 


where  F?.(t)  =  1  -  F..(t)  . 
ij  ij' 

Because  of  the  impossibility  of  zero  intervals  between  transitions, 
the  right  hand  side  of  (10)  contains  only  the  past  history  of  V'j(x)  (0  <x<  t)  . 
The  principle  of  optimality  can  thus  be  used  to  write  the  recurrence  relation 
for  expected  total  return  when  following  an  optimal  policy  as: 
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V.(t)  =  max 

i 

z 


^  r  t  1 

2  ’’y  i  }  (HI 

j=l  L  J 


for  t  >  0  and  all  i  ,  with: 

V..(0)e‘‘^F.‘:’ ^|t)  +  R?.  + 
ij  ij  ij 

f*  t 

.  z  1  -arx  _c,  z ,  .  , 

(■  r . .  \  e  F . .  (x)  dx 

iJ  Jo  iJ 

With  these  definitions: 

N 

V.(0)  =  max  y  p*.(V..(0)  +  R?.)  , 

1  z  L  "^3  ^3  iJ 

j=l 


(o.t) 


IN 

i=l 


(12) 


(13) 


i.e.  ,  it  is  possible  to  have  a  reward  with  a  zero-length  experiment-  If 

desired,  this  anomaly  can  be  removed  by  making  the  terminal  reward  a 

function  of  state  i  only,  i.e.,  V^j(O)  =  V^(0)  for  all  i,j  ,  and  by  collecting 

the  return  R^^  at  some  time  slightly  after  t  =  0  .  Or,  R^^  can  be  collected 

at  the  end  of  the  transition  period,  in  which  case  R..  should  be  replaced  by 

pt 

by  R..  \  e"“‘dF..(x)  in  (4),  (6)  and  (12). 

‘J  Jq 

There  is,  unfortunately,  no  general  way  in  which  (II)  can  be 
completely  resolved.  There  are  various  approximation  techniques  in  return¬ 
er  policy- space^  ^  ^  ^  ^  ^  which  may  converge  on  an  answer;  however,  if  in  fact 
digital  computation  is  to  be  used,  then  one  may  proceed  directly  to  discrete 
approximations  of  the  continuous  time  variable  in  (11).  Letting  t  =  kA 
(k  =  1,  2,  3,  ... )  ,  one  obtains  for  all  i  ,  the  approximations: 
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N  k 

V.{kA)  =  max  |ir*(ft.kA)  +  ^  p*  ^  e‘*^^V.((k  -  i)A)  • 

*  j=l  -  - 

•  F*  (iA)  -  F*  (iA  -  A) 

which  can  be  built  up  in  the  usual  recursive  manner.  Round-off  and 

truncation  error  limit  the  choice  of  small  values  of  A  ,  but  the  use  of  more 

sophisticated  quadrature  techniques  wil  usually  lead  to  results  of  sufficient 
[4] 

accuracy. ^ 

It  should  be  obvious  that  the  optimal  strategies  deduced  for  a  time- 
horison  problem  need  not  bear  any  resemblance  to  the  optimal  strategies 
for  a  problem  with  a  fixed  number  of  transitions,  except  possibly  in  the 
limit,  if  a  stationary  optimal  policy  exists.  It  is  this  point  which  we  examine 
in  the  next  section. 

VII.  Discounted  Cases  with  Infinite  Step  or  Time  Horizons 

It  is  a  simple  matter  to  verify  that  expected  returns  V^(n)  or  V^(t) 
remain  finite  as  n  or  t  approach  infinity  in  (8)  and  (11)  for  all  a  >  0  , 
since  all  elements  of  the  matrix  q  =  (p^jf^^(3))  lie  in  the  interval  [O,  1) 
for  all  s  >  0  ,  and  for  all  policies. 

The  limiting  form  of  (8)  as  n  oo  is: 

N 

V.  =  lim  V.(n)  =  max  -[pNa)  +  Y 


} 


(14) 


for  all  i  . 
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From  the  Laplace  transform  of  (11)  and  the  well-known  limit  theorem 

of  transform  calculus,  lim  V{t)  ^  lim  [sV(s)],  the  limiting  form  for 

t— 00  s— O'*’ 

an  infinite  time  horizon  is: 

N 

V  =  limV  (t)  =  max  /pf(o)  +  Y  p?.f.»(o)V.  ]  (15') 

t-^  z  ^  A  J  J 

J=1 

since  lim(r.(a',  t)  =  p.(o)  . 
t-'oo 

Consider  for  a  moment  only  the  stationary  policies;  that  is,  ones 
which  do  not  depend  upon  the  number  of  steps  (or  time)  since  the  beginning 
of  the  experiment,  nor  until  the  end  of  problem.  It  follows  that  the  return 
from  every  stationary  policy  in  (15)  or  (15')  must  satisfy  the  simultaneous 
equations  (i  s  1,  2,  .  .  . ,  N) 


N 

V,  =  P.M  *  'I 
j=l 


(16) 


where  the  dependence  upon  the  policy  has  been  suppressed  for  clarity.  In 
other  words,  by  following  the  optimal  stationary  policies  in  either  the  n-step 
or  t-horizon  formuation,  one  obtains  the  same  stationary  policies,  and  the 
same  limiting  values  for  the  expected  total  discounted  return  starting  in 
state  i  I 

We  now  present  an  algorithm  related  to  the  policy-space  iterative 
technique  of  Howard^ ^  1[6J[9]  optimal  stationary  policy  for  an 

infinite,  discounted  Markov  renewal  program.  The  flow  chart  for  the 
algorithm  is  shown  in  Figure  1.  Basically,  the  algorithm  uses  (16)  to  solve 
for  a  set  of  expected  returns  following  some  policy;  then,  those  returns 
are  used  to  select  a  better  alternative  in  each  state.  When  two  successive 
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Guess  an 

initial  policy 


OR 


Guess  an 
initial  set  of 
returns 


Using  the  ,  and  p^(o')  for  the  current 

policy,  solve  the  set  of  equations 
N 

Vi  =  p.{o)  +  i  =  1.2 . N 

j=l 

to  determine  the  present  expected  returns,  . 


For  each  state  i  ,  find  the  alternative  z(i) 
which  maximizes 

*  I 

j=l 

using  the  present  returns,  .  Make  z(i)  the 

the  new  alternative  in  the  i^^  state.  (If  there  is  no 
improvement  in  the  test  quantity  from  the  last 
cycle,  retain  the  same  alternative. )  Repeat  for 
all  states  i  =  1,  2,  ...» N  . 


Otherwise 

continue 


If  the  new  policy  is 
identical  with  the  one 
from  the  last  cycle. 


1 

DONE 


Figure  1  —  Flow  chart  of  algorithm  for  optimal  policy  for 
infinite -horizon,  discounted  Markov-renewal 
program. 
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policies  are  found  to  be  identical,  the  algorithm  terminates  with  an  optimal 
stationary  policy,  and  the  maximum  expected,  discounted  returns. 

To  prove  that  this  algorithm  converges,  one  must  show  that: 

(1)  It  is  always  possible  to  solve  the  set  of  simultaneous 
equations. 

(2)  The  policy-determining  step  strictly  increases  the 
expected  return  of  at  least  one  state  in  each  cycle 

of  the  algorithm,  if  there  was  an  improvement  in  the 
test  quantity  which  led  to  the  change  in  policy. 

(3)  If  two  successive  policies  are  identical,  then  the 
algorithm  has  converged  on  the  optimal  policy,  in 
the  sense  that  no  other  policy  can  lead  to  higher 
expected  returns  for  any  state  i  . 

(4)  The  algorithm  terminates  in  a  finite  number  of  cycles. 

The  proof  of  (1)  follows  from  the  fact  that  all  elements  of  the  matrix  q(a) 
lie  in  the  interval  [0,  1)  for  all  o  >  0  ,  while  the  proof  of  (2)  requires 
the  additional  observation  that  the  diagonal  elements  of  the  matrix 

[l  -  q(o)]  *^  are  at  least  as  great  as  one.  The  complete  proof  oi  this 
algorithm  closely  parallels  that  of  Howard  for  the  Markov  decision  process, 
and  the  reader  is  referred  to  Chapter  7  of  Reference  [9  ]  for  further  details. 

The  fact  that  there  are  only  a  finite  number  of  policies  guarantees  conver¬ 
gence  in  a  finite  number  of  cycles. 

There  are  also  available  some  special-purpose  linear  programming 
algorithms  to  find  the  optimal  stationary  policy  in  a  Markov  decision 
process^  ^][20j[  14  ][  1][8]  could  equally  well  be  applied  to  this 

problem. 

It  is  important  to  notice  that  direct  enumeration  of  all  stationary  policies 

N 

usually  not  possible,  since  there  are  Z  of  them. 
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VIII.  The  Optimal  Policy  with  Discounting 

It  is  possible  to  produce  a  slightly  stronger  result  than  that  of  the 
last  section;  namely:  among  all  the  optimal  policies  for  the  inf inite  -  step 
or  infinite-time  discounted  case,  there  exists  an  optimal,  stationary  policy 
This  is  important  operationally  since  it  may  be  difficult  to  follow  a  non¬ 
stationary  policy.  Our  proof  closely  parallels  that  of  Blackwell  for  the 
discounted  Markov  decision  process,  ^  ^  ^  and  we  shall  only  sketch  in  the 
major  steps. 

For  the  purposes  of  this  section  only,  number  the  policy  to  be 

followed  in  the  state  in  terms  of  the  number  of  steps  from  beginning 

of  the  experiment;  thus  z^(i)  is  the  policy  to  be  followed  if  state  i  is 

entered  at  the  n^^  step  from  the  beginning  of  the  process.  Letting  z  be 

th 

the  N-dimensional  vector  of  policies  at  the  n"  step,  call  y  = 

the  sequence  of  policies  to  be  followed,  starting  at  step  one.  Finally  a 

sequence  of  policies  y  is  said  to  be  optimal  if  V^{y  )  >  V^(v)  for  all 

i  =  1,  2,  . .  . ,  N  ,  and  for  all  possible  sequences  of  policies  y  . 

The  essential  step  in  Blackwell's  proof  lies  in  the  observation  that 

st 

the  transformation  which  maps  the  returns  at  the  (n  -  1)  step  into  the 
returns  at  the  n  step  is  monotone.  In  other  words,  if  VI  >  V.  for 

_  j  j 

all  j  ,  then: 


N 


j»i 


qf.  V.  > 


N 

I 

j=l 


tr 

q. .  V. 


for  all  i  .  But  this  fact  follows  from  the  observation  that  all  of  the  qf. 
are  non -negative. 
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The  proof  then  proceeds  in  the  following  steps; 

1.  If  the  expected  returns  following  the  policy  sequence 
{z  }  =  (z^,  z.^,  z^, .  .  .  }  are  all  greater  than  or  equal 

to  the  corresponding  returns  following  the  policy  sequence 
{f,  Zj,  z^,  •  •  •  }  ,  for  all  possible  policy  vectors  f  ,  then 
the  sequence  of  policies  {z  }  is  optimal. 

2.  If  the  expected  returns  from  the  policy  sequence 

{f,  z  j,  z^,  .  .  }  are  all  not  less,  but  at  least  one  is 
greater  than,  the  corresponding  returns  from  the  policy 
sequence  {z^^,  z^,  •  }  ;  then,  the  stationary  policy 

sequence  {f,  f,  f,  . . .  }  stands  in  the  same  relationship 
to  the  policy  sequence  {zp  z^,  z^,  .  .  .  }  . 

3.  Taking  any  stationary  policy,  either  it  is  optimal,  or  there 
is  an  improvement  possible  with  another  stationary  policy. 
Since  there  are  only  finitely  many  stationary  policies, 
there  must  be  one  over  which  no  improvement  can  be 
made;  hence  this  one,  which  is  found  by  the  algorithm 

of  Figure  1,  must  be  optimal. 

For  details,  the  reader  is  referred  to  .Section  3  of  Reference  [6]. 

A  similar  result  is  expected  to  obtain  in  the  case  of  the  Infinite -time, 
discounted  process  because  of  the  monotonicity  of  the  transformations  in 
(11),  or  the  quantized  equivalent  (14),  and  because  it  is  already  been  seen 
that  the  stationary  optimal  policies  for  the  infinite -time  case  give  the  same 
expected  return  as  for  the  infinite-step  case. 

IX.  The  Problem  of  No  Discounting 

When  the  discount  factor  o  approaches  zero  in  either  the  finite-time 
or  finite-step  process,  it  is  seen  from  (8),  and  (11)  or  (14)  that  no  particular 
difficulties  are  encountered. 
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which  can  be  resolved  through  a  quantized  approximation  such  as  (21). 

As  the  planning  horizons  approach  infinity,  the  expected  total  returns 
without  discounting  also  become  infinite,  and  it  is  not  clear  what  objective 
should  be  set  for  system  optimization.  There  are  three  distinct  "infinite" 
objectives  which  might  be  posed: 

(i)  Attempt  to  find  policies  in  (17)  which  are  optimal  for  all  n 
sufficiently  large. 


(ii)  Attempt  to  find  policies  in  (19)  which  are  optimal  for  all  t 
sufficiently  large. 
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or 

(iii)  Attempt  to  find  policies  in  (15)  which  are  optimal  for  all  a 
sufficiently  close  to  zero. 

Unfortunately,  there  is  no  a  priori  reason  to  assume  that  the  policies 
which  might  be  found  from  these  three  approaches  would  in  any  way  resemble 
each  other.  From  Blackwell's  investigation  of  case  (iii)  for  the  Markov 
decision  process,^  ^  ^  it  is  known  that  there  may  be  both  optimal  and  near- 
optimal  policies  as  the  discounting  vanishes.  He  also  shows  that  it  may  be 
very  difficult,  computationally  speaking,  to  find  all  of  the  optimal  or  near- 
optimal  policies.  Finally,  when  discounting  vanishes,  the  structure  of  the 
underlying  Markov  chain  becomes  more  important  in  determining  the  nature 
of  the  limiting  results,  and  this  must  be  taken  into  consideration. 

In  order  to  partially  circumvent  some  of  these  difficulties,  the  in¬ 
vestigations  of  the  limiting  cases  in  the  following  sections  will  have  the 
following  additional  restrictions: 

[  i  ]  Only  stationary  policies  will  be  investigated. 

[  ii  ]  The  Markov- renewal  process  will  be  assumed  to  have  a 
single,  finite,  underlying  Markov  chain  which  is 
ergodic  (irreducible  and  positive  recurrent)  for  every 
policy. 

[iii]  All  of  the  v. .  are  assumed  finite. 

These  assumptions  are  not  too  unreasonable  for  real  problem  solutions,  since 
a  stationary  policy  is  usually  desirable  for  long-term  planning  —  primarily 
because  of  the  stability  it  introduces,  but  also  because  of  the  ease  of  modi¬ 
fication  if  the  input  data  changes.  The  elimination  of  transient  and  absorbing 
states,  or  of  multiple -chain  structures,  also  presents  no  problems,  since 
special  extensions  can  be  developed  for  these  cases.  If  any  of  the  are 
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infinite,  the  process  will  tend  to  get  stuck,  on  the  average,  in  that  state; 
thus,  this  state  behaves  as  if  it  were  absorbing,  and  should  be  separately 
handled. 

The  assumption  of  a  finite  number  of  states,  and  a  finite  number  of 
alternatives,  is  very  important  and  cannot  be  easily  eliminated. 

X.  Infinite  Step,  Discounted  Case 

The  approach  to  be  used  is  to  show  that  (17)  gives  a  limiting  form 
V^(n)  X  Gn  for  all  i  ,  and  for  a  certain  stationary  policy  z(i)  . 

An  algorithm  will  then  be  produced  which  finds  the  optimal  stationary  policy, 
in  the  sense  that  it  produces  a  scalar  G  which  is  at  least  as  large  as  that 
obtained  for  any  other  policy. 

Assume  that  we  are  following  some  stationary  policy,  and  let 
V(n)  denote  the  column  vector  of  expected  returns  at  the  n^^  step,  p  the 
column  vector  of  one>step  returns,  and  yP  the  matrix  (p..)  of  transition 
probabilities.  Equation  (17)  can  be  written  as: 

V(n)  =  p  +  ^V(n  -  1)  =  [1  +  ^  +  ^^+...  +  ]  P 

+  V(0)  n  =  1,2,  .. .  (21) 


where  I  is  the  unit  matrix.  Then, 


V(n)  -  V(n  -  1)  = 


But,  if  the  Markov  chain  is  ergodic,  converges  or  is  Cesaro- 

summable  to  a  probability  matrix  11 ,  each  row  of  which  is  the  same  (row) 
vector  ir  =  {ir.,  ir,,  . . . ,  ir  }  ,  whose  elements  are  all  positive.  In  addition 
ir  is  the  unique  probability  vector  which  satisfies  ir^  =  rr  ;  that  is,  it  is 


The  scalar  G  should  not  be  confused  with  the  functions 


G..(t) 

ij' 
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Thus, 


the  stationary  vector  for  the  ergodic  chainJ^^^ 
lim  V(n)  -  V(n  -  1)  =  np 

n-*oo 


or, 


N 

lim  V.(n)  .  V.{n  -  1)  =  Y  ir.p.  =  G 
n  *00 


(22) 


for  all  i  .  The  scalar  G  is  called  the  system  gain  by  Howard;  notice  that 
for  a  single  ergodic  chain,  it  is  independent  of  the  state  i  . 

Let  W^(n)  =  V.(n)  -  Gn  ;  then: 

W(n)  =  +  ^  +  -  nn]p  +  V(0) 


n-  1 

I  +  ^  (^j  .  n)  -  n 

j=i 


p  +  V(0)  n  =  1,2,  ..  .  (23) 


n-  1 


But,  if  the  underlying  chain  is  ergodic,  lim  [l  +  2  -11)]  converges 

n-*oo  j=l 

or  is  Ces^ro-summable  to  what  Kemeny  and  Snell  call  the  fundamental 


matrix^^^^  Z 


(z..)  .  By  simple  manipulations,  it  is  seen  that  Z  is  the 
-1 


well-defined  inverse  (I  -  (^  -  H) )  >  and  satisfies  the  relations 

^Z  =  Z^  ,  ttZ  =  w  ,  and  I  -  Z  =  H  - ^Z  .  Thus, 


W  =  lim  W(n)  =  [Z  -  n]p  +  V(0) 
n"*oc 


(24) 


or, 

N 

V.(n)  2;  Gn  -  G  +  ^  *ij  Pj  ^ 

j=l 
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for  all  i  ,  where 


N 

G  =  ^  ir.p.  ;  Z  =  (I  -  /'  +  n)'^  .  (26) 

i=l 


There  is  another  interpretation  of  (25)  which  is  of  interest.  Let 

M..  be  the  mean  number  of  times  the  system  enters  state  j  in  n 
iji  n 

transitions,  if  the  system  was  started  in  state  i  ■  It  is  clear  that  for  all 
i,  j  ,  and  n  =  1,2,... 


n-1  N 

y  =  <5..  4  'S  P.,  M,  .  ,  (27) 

/.  1)  iJ  Z.  ik  kj,n-l 


M.. 

ij.n 


k=0 


k  =  l 


so  that,  if  M  is  the  matrix  (M..  )  , 

n  ij,  n 


V(n)  =  M^p  4  V(0)  ; 


n  =  1,2,. 


(28) 


that  is,  the  expected  value  of  a  certain  state  above  and  beyond  its  terminal 
value  is  equal  to  the  mean  number  of  times  that  other  states  of  the  chain  are 
visited  in  n  steps  times  the  expected  one-step  rewards  in  those  states, 
summed  over  all  states  which  are  visited. 

Furthermore,  it  is  known  that  for  large  n  and  all  i,  j 

X  nn  4  [Z  -  n  ]  (29) 


or 


M. 


ij.n 


nir.  4[z..  -TT.]  , 

J  ^  iJ 
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which  is  an  alternate  way  of  deducing  (25).  Thus  the  fact  that  the  gain  is  the 
same  for  all  initial  states  is  a  consequence  of  the  limiting  properties  of  the 


M.,  which  depends  upon  convergence  to  a  stationary  set  of  probabilities. 

ij 

The  constant  term  represents  the  bias  due  to  the  initial  starting  state. 

It  now  is  possible  to  produce  an  algorithm  to  find  the  optimal 
stationary  policy;  this  algorithm  will  parallel  Figure  1,  and  the  corresponding 
algorithm  for  Markov  decision  processes  in  Chapter  4,  Reference  [  9]. 

There  are  some  computational  simplifications  which  can  be  made.  At 
each  step  in  the  algorithm,  we  shall  be  testing  to  soe  which  alternative  will 
g  ive  the  greatest  possible  increase  in  return;  substituting  the  asymptotic 
expression  (25)  into  (21),  we  obtain  after  clearing  terms, 


N  N  N 

j=l  j=l  k=l 


for  all  i  .  There  are  N  simultaneous  equations  in  the  N  +  1  unknowns; 

N 

G  ,  and  the  ^  z..p.  .  Thus, as  Howard  has  pointed  out,  the  unknowns 
j=l  •' 

cannot  be  resolved  uniquely,  but  one  can  only  find  the  relative  values  of  the 
N 

£  z..p.  .  These  numbers  are  called  the  relative  values  of  the  policy,  and 

j=l  3 

are  denoted  by  ;  it  is  usual  practice  to  set  one  of  them,  say  ,  equal 


to  zero,  and  then  solve  the  equations  (30),  which  are  now  well-defined. 
For  i  =  1,2,  ...,N-1  : 

N  -1 

V.  +  G  =  p.  +  y  p..V. 

1  L  j 

j=i 


=  0  ,  o  =  ^  p„.v.  . 

j=l 
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Figure  2  —  Flow  chart  of  algorithm  for  optimal  atationary 
policy  for  infinite -step,  undiscounted  Markov- 
renewal  program. 
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The  advantage  of  the  equations  (31)  is  that  they  are  much  easier  to  compute  at 
each  step  of  the  algorithm,  rather  than,  say  finding  first  H  ,  then  Z  ,  and 
then  the  test  quantities. 

The  right-hand  side  of  (31)  is  used  as  the  test  quantity  for  policy 
improvement  in  the  algorithm  shov'n  in  Figure  2;  the  procedure  parallels 
that  of  Chapter  4,  Reference  [9  ] ,  expect  for  the  immediate  expected  reward, 
Pj  ,  given  by  (18),  which  is  different,  due  to  the  nature  of  the  Markov- 
renewal  process.  The  same  remarks  which  were  made  in  conjunction  with 
Figure  1  still  obtain;  the  policy-determining  step  strictly  increases  the 
gain  G  ;  the  algorithm  converges  on  the  optimal  policy,  when  two  succeeding 
policies  are  identical,  and  so  on.  The  proof  is  elementary,  and  may  be 
found  in  the  above  reference. 

When  the  algorithm  is  terminated,  an  optimal  stationary  policy 

9fr 

z  (i)  will  have  been  found,  together  with  the  maximal  gain  G  and  relative 

* 

values  .  At  this  point,  one  may  calculate  the  stationary  probabilities 
and  the  fundamental  matrix,  and  thence  find  the  optimal  total  return  as: 

N 

V*{n)  »  G*n  -  G*  +  V*  +  V^(0)  +  Y  z*  .  p .  (32) 

j-1 


for  all  i  . 

XI.  Infinite  Time,  Undiscounted  Case 

Because  of  the  close  relationship  between  the  infinite -step  and 
infinite -time  solutions  in  the  discounted  case,  it  might  be  expected  that 
this  section  would  be  a  repeat  of  the  last  one;  however,  one  of  our  main  re¬ 
sults  is  that  this  is  not  the  case.  First,  it  is  shown  that  (19)  gives  a  limiting 
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form  V^(t)  gt  +  as  t  approaches  infinity,  for  all  i  ,  and  for  a  certain 
stationary  policy,  z(i)  .  We  shall  then  produce  an  algorithm  to  find  the 
stationary  policy  which  maximizes  the  rate  of  increase  of  the  expected  value, 

g  • 

Assume  some  stationary  policy  and  take  the  Laplace  transform  of 
(19).  For  all  i  ,  and  for  s  >  0  : 

N 

V.(8)  =  ?.(8)  +  Y  qij(s)V.(8)  ,  (33) 


with 


N 

J  P,j 

i=l 


V..(0)  [  1  -  f..(s)]/8  +  R../8  + 


+  r. .[  1  -  f.  .(8)  ]  /  8 


(34) 


Denoting  the  various  column  vectors  by  dropping  the  subscript,  and  the 
matrix  of  (q.;(s))  by  q(s)  ,  Equation  (33)  becomes 


V(s)  =  [I  -  q(s)]"^5^(8) 


(35) 


The  matrix  I  -  q(s)  has  an  inverse  for  s  >  0  ,  but  as  s  approaches  zero 
both  the  inverse  and  J(s)  become  ill- defined.  This  difficulty  can  be  resolved 
through  the  use  of  first-passage  time  distributions,  G.j(t)  ,  and  the  mean- 
entry  counting  functions,  M.j(t)  ,  which  are  discussed  in  Appendix  A. 


The  scalar  g  should  not  be  confused  with  the  functions 


gy(t) 
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Combining  (35)  and  the  transform  of  (A.  6), 


V(8)  =  5(8)  +  m(s)5(s) 
or,  for  t  >  0  ,  i  =  1,  2,  . . . ,  N  : 


V.(t)  :=  (r.{t)  + 


<rj(t  -  x)dM.j(x) 


(36) 


(37) 


The  relationship  between  the  expected  return  and  the  mean  entry -counting 
functions  in  (37)  should  be  compared  with  corresponding  relationship  (28) 
for  the  inf inite  -  step  process. 

The  M^j(t)  are,  of  course,  not  the  same  as  their  discrete  counter¬ 
parts  M.,  ,  but  are  related  to  the  first-passage  time  distributions  C..(t) 

ij ,  n  i  j 

through  the  transform  of  (A.  5),  for  s  >  0  ,  and  all  i,  j  : 

ni..(8)  =  g..(s)  +  m..(s)g..(s)  (38) 

ij  jj'  *ij' 

Because  of  the  assumption  of  finite  ,  it  turns  out  that  all  of  the 
mean  first-passage  times,  p..  ,  are  also  finite  for  an  ergodic  chain.  At 
this  point,  it  is  convenient  also  to  assume  that  the  diagonal  second  moments, 
are  also  finite.  Finally,  it  is  important  to  distinguish  between  two 
cases;  either  a  ^  lattice  distribution,  or  it  is  not.  In  the  first  case, 

our  results  (39)  and  (40)  hold  only  when  averaged  over  the  lattice  period  in 
question.  It  can  be  seen  that  a  sufficient  condition  for  a  Gjj(i)  non¬ 

lattice  distribution  is  that  at  least  one  nonzero  Q^j(i)  also  be  c.  non-lattice 
distribution,  for  i  =  1,  2,  .  . . ,  N  . 
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With  these  conditions,  one  has  the  interesting  result  for  large  t  , 

.  ,,  .  .  [  1][  18]» 
and  all  i,  j  ;  *•  ^ 


M..(t) 

ij' 


This  relationship  may  be  found  directly  from  (38),  using  well-known  limit 

theorems  of  transform  calculus;  or,  it  follows  directly  from  the  observation 

that  M.-(t)  is  the  mean  renewal  (counting)  function  for  the  renewal  process 

f  19  1 

with  inter-event  distribution  G..(t)  .  and  the  use  of  a  theorem  by  Smith.  ^ 

JJ  ^ 

The  limiting  form  of  the  mean  entry -counting  function  is  thus  related  only 

to  the  first  and  second  moments  of  the  first-passage  time  distributions. 

r  19  1 

By  using  the  key  renewal  theorem*’  ^  or  through  direct  transform 
arguments,  we  then  argue  that  (36)  has  the  limit  for  large  t  ,  and  all  i  ; 


V.(t)  5: 


N  A  N  r  (2)  1  „ 

k  j=l  "  Jjj  JJ ' 


where  37^  is  the  area  under  -  o'^(t)  .  P'rom  (20),  this  is: 


IN 

~  Z  1 ''jk  "^jk} 


If  the  fixed  rewards  R. .  are  paid  at  the  end  of  the  interval  before  transition, 

ij 

instead  of  at  the  beginning,  then  a  term.  must  be  added  under  the 

summation  sign  in  (41). 

The  slight  discrepancy  between  (39)  and  Equation  (A.  9)  of  Reference  [  1  ] 
is  due  to  their  convention  of  smoothing  out  a  lattice  distribution  function  into 
the  following  period,  while  we  use  a  symmetric  smoothing. 
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Thus,  the  limiting  form  of  the  expected  total  return  is  of  the  form 


V.(t)  ~  gt  +  ,  with  the  gain  rate  g  =  being  the  same  for  all 

initial  states  —  a  consequence  of  the  assumption  of  ergodicity  of  the  under¬ 
lying  Markov  chain. 


.(2)  . 


Appendix  A  presents  formulae  suitable  for  finding  the  p..  and  the 

J  J 


u.. '  :  calculation  of  these  moments  is  not  essential,  however,  to  find  the 
optimal  stationary  policy.  Substituting  the  limiting  form  in  (19) 


N 


w.  ==  p.  -  g 


y  +  wp  +  |<r.(t)  -  p.  + 


j=l  ‘ 


w,  F.':(t) 

.1  ij 


(42) 


for  all  i  .  For  large  t  ,  the  terms  in  braces  vanish  because  of  the  assump¬ 
tion  about  finite  I'.j  ,  and  one  obtains  the  following  equations  in  the  N  +  1 
unknowns,  g  and  the  w^  : 


N 


w.  +  gv 


.  -  p.  +  ;  p. .  w. 

j=l 


(43) 


for  i  =  1,2,  ...  ,N  .  Comparison  with  the  corresponding  infinite-step 
relations  (31)  reveals  that  they  are  identical  except  for  the  coefficient  in 
front  of  the  gain  rate,  g  .  Thus,  unless  all  of  the  are  identical,  and 
equal,  say  v ^  =  .  .  .  =  v  ,  then  the  solutions  to  the  Equations  (43) 

will  not,  in  general,  be  equal  to  the  solutions  of  (31).  We  shall  return  to 
a  discussion  of  this  point  in  a  later  section. 
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Furthermore,  from  our  previous  comments  about  the  form  of  the 

Equations  (43),  we  know  that  they  cannot  be  used  to  solve  for  the  complete  w^ 

given  by  (40),  but,  will  only  find  the  relative  values  of  ,  after 

th 

setting  the  variable  to  zero.  The  following  modified  equations  are  used 

to  solve  for  the  relative  values  of  the  policy,  v^  .  For  i  =  1,  2,  . . . ,  N>  1  : 

N-1 

Vi  +  gi^.  =  p.  +  y 
j=l 


p.  .V. 


(44) 


and 


'N 


=  0 


=  P 


N 


N-1 

j=l 


p.  .V. 

J 


Our  algorithm  for  finding  the  optimal  stationaly  policy  is  shown  in 
Figure  3.  In  basic  structure,  it  is  identical  to  the  policy  approximation 
algorithms  of  the  previous  sections;  relative  values  and  a  gain  rate  obtained  in 
a  previous  cycle  are  used  in  a  test  quantity  to  find  a  new  policy  with  increased 
gain  rate;  the  new  policy  is  used  to  solve  (44)  for  the  new  values  and  gain 
rate;  and  so  on.  The  algorithm  terminates  when  no  change  in  policy  can  be 
made. 

The  one  new  feature  is  the  form  of  the  test  quantity;  Equations  (44)  must 
be  divided  through  by  the  i/.  (which  are  nonzero  and  finite)  to  form  a  test 
quantity  which  has  the  dimensions  of  reward  rate;  this  seems  logical  in  view 
of  the  fact  that  the  algorithm  increases  g  at  each  step.  An  alternative  test 
quantity  has  been  proposed  by  P,  Schweitzer  (Appendix  B). 

The  proof  is  still  elementary,  and  parallels  those  of  the  previous 

*2  ^1 

algorithms;  basically  one  shows  that  g  >  g  ,  if  at  some  cycle  the  test 
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Figure  3  —  Flow  chart  of  algorithm  for  optimal  stationary  policy 
for  the  infinite -time)  undiscounted  Markov-renewal 
program,  and  for  the  infinite -time  or  -step,  vanishing 
discount  Markov-renewal  program. 
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quantity  indicated  a  change  to  policy  using  the  relative  values  and  gain 

rate  of  policy  .  When  the  algorithm  is  terminated,  an  optimal  stationary 

policy  z  (i)  will  have  been  found,  together  with  the  maximal  gain  rate  g  , 

and  relative  values  v^  .  At  this  point,  one  may  calculate  the  stationary 

(21 

probabilities,  ir^  in  the  usual  manner,  and  the  p. .  and  '  from 
Equations  (A.  8),  (A.  9)  or  (A.  11).  The  optimal  expected  return  is: 


V.(t) 


*  e 

g  t  +  V.  +  + 


*^33  J 


(45) 


for  all  i  . 


XII.  The  Vanishing  Discount  Case 

The  final  limiting  case  to  be  investigated  is  that  of  the  infinite-step  or 
infinite-time  process  whose  discount  factor,  o  ,  vanishes.  Thus  we  seek  the 
limiting  form  for  all  i  ,  with  a  approaching  zero,  of: 


N 

Vi  .  Pi(.)  t  ^  Plj?ij(»)Vi 

j=l 

with 

N 

3=1 


(16) 


(6) 


A  review  of  the  steps  encountered  in  finding  the  asymptotic  form  of 
the  transform  of  (40)  in  the  last  section  will  indicate  the  necessary  parallel 
between  (33)  and  (16).  It  is  easy  to  show  that  as  a  approaches  zero,  for 
all  i  : 
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y^ia)  =  ^/a  +  P^ 


+ 


N 


with 


i  I  +  0{o)  (46) 


N 


/ 


=  I 

j=l 


N 

=  I 


j=l 


f+^r  1/(2)  1 


(47) 


If  the  fixed  rewards  R. .  are  paid  at  the  end  of  the  transition  interval,  then  a 

ij  ^ 

term  +R.,  v.,  must  be  added  under  the  summation  sign  defining  the  f).  • 

JK  JK  } 

Thus,  rather  suprisingly,  the  criterion  for  optimization  turns  out 
again  to  be  the  gain  rate,  ^  ~  g  -  The  algorithm  to  be  used  is  a  repeat  of 
that  shown  in  Figure  3;  even  the  same  relative  values  v^  being  obtained.  The 
one  slight  difference  in  this  case  is  that  the  terminal  rewards  no  longer  enter 
into  the  calculation  of  the  w^  ;  however,  this  does  not  affect  the  optimization, 
but  merely  the  final  expected  return. 


XIII.  Ties  and  Near -Optimal  Policies 

In  each  of  the  three  limiting  cases  just  discussed,  the  criterion  for 
optimization  has  been  the  dominant  term  in  the  limiting  form  of  the  expected 
total  return,  either  Gn  ,  gt  ,  or  ^la  .  However,  it  may  happen  that  when 
the  algorithms  of  Figures  2  and  3  are  carried  out,  there  will  be  more  than 
one  optimal  stationary  policy  —  each  with  the  Same  gain,  or  gain  rate  I 

Blackwell^  ^  ^  has  considered  this  problem  in  detail  for  vanishing 
discount  in  Markov  decision  processes.  It  is  shown  that  when  the  algorithm 

terminates  with  a  single  policy  z  ,  and  when  the  test  quantity  is  strictly 

z*  z*  * 

less  than  Vf  +  G  for  all  other  alternatives,  for  each  i  ,  then  z  is 

2 

optimal,  in  the  sense  that  no  other  policy  leads  to  a  higher  value  for  V^,  {a)  , 
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for  all  a  sufficiently  close  to  zero.  If  a  truely  optimal  policy  cannot  be 
found  from  this  algorithm,  Blackwell  points  out  that  there  may  be  nearly- 
optimal  policies,  i.  e. ,  policies  whose  return  converges  to  the  return  from 
an  optimal  policy,  for  o  0  .  The  near-optimal  policies  are  just  those 
for  which  both  the  gains  (or  gain  rates)  and  the  constant  terms,  W.  or  w^  , 
are  comparable.  The  determination  of  all  the  near-optimal  policies  appears 
to  be  an  arduous  task,  in  general,  in  the  event  that  there  are  ties,  since  the 
relative  values,  V.  and  v^  ,  are  not  sufficient  for  absolute  comparisons. 

Direct  evaluation  of  all  of  the  nearly-  optimal  policies  may  be  feasible 
for  small  problems  but  is  probably  prohibitive  for  a  "reasonable"  real 
problem.  On  the  other  hand,  one  might  claim  that  a  real  problem  which 
gave  two  policies  with  the  same  gain  rate  had  insufficient  or  inaccurate  data  I 
Nevertheless,  the  problem  of  ties  is  still  an  interesting  and  unresolved 
question,  computationally. 

XIV.  The  Difference  Between  Limiting  Cases 

A  disturbing  feature  of  the  three  cases  of  limiting  programs: 

1.  o  -  0  ;  n  00 

2.  o  =  0  ;  t  00 

3.  n,  t  =  00  ;  o  0 

is  the  fact  that  they  may  give  different  optimal  stationary  policies.  As  we 

have  seen,  in  the  first  case  the  algorithm  maximizes  the  per -transition  gain, 
N 

G  =  X  ir.  p.  ;  while  in  the  second  and  third  cases  the  algorithm  maximizes 
i=l  '  '  N 

the  gain  rate,  g  =  >  =  Z  p./p.^  • 

«  i=i  '  '' 

A  rationale  for  the  equivalence  of  the  second  and  third  cases  is  that 
discounting  may  be  interpreted  as  an  experiment  in  which  in  each  dt  there 
is  a  probability  odt  of  entering  an  absorbing  state,  i.  e.  ,  discontinuing  the 
experiment.  Thus,  the  behavior  will  reflect  that  of  the  time-horizon  process. 
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rather  than  a  tranaition  — horizon  process. 

The  difference  between  the  first  two  cases  is  due  to  both  the  reward 
structure,  and  to  the  possible  different  sojurn  times  between  transitions. 

For  instance,  consider  a  one- state  periodic  process  of  period  v  ,  in  which 
a  reward  R  is  given  at  every  "transition"  from  the  state  back  into  itself. 

In  order  to  select  the  maximum  reward  per  transition,  one  would  select  the 
policy  with  largest  R  ,  no  matter  how  large  n  was.  But  in  order  to  maximize 
reward  rate  for  a  large  interval  of  time,  one  would  select  the  policy  with 
largest  R/v  ! 

The  two  criteria  can  be  contrasted  in  more  generality  through  (A.  10). 

N 

G  =  ^  TT^p.  versus  g 

iVl 

The  gain  rate  is  influenced  by  the  mean  sojurn  times  in  the  states  of  the 
system,  weighted  by  the  stationary  probabilities  of  making  a  transition  to 
those  states  —  and  this  may  change  with  a  change  in  policy,  even  when  G 
does  not. 

In  fact,  one  must  make  a  distinction  betv/een  two  sets  of  stationary 
probabilities  in  Markov- renewal  processes.  The  are  the  stationary 
probabilities  which  are  the  limiting  values  of  being  in  state  i  after  n  trans¬ 
itions,  as  n  00  .  There  are  also  stationary  probabilities,  ,  which 
are  the  limiting  values  of  the  probability  of  being  in  state  i  at  time  t  ,  as 
t  “•  00  For  an  underlying  ergodic  chain,  the  two  sets  of  probabilities 

are  related  by: 


N 

£  TT.  p. 

i=l  ^  ' 

17 - 

2  ir,  V. 

i-.^l  ^ 


(48) 
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V. 

1 


17. 

1 


(49) 


P. 

1 


“TT 

z: 

k=l 


for  i  =  1, 2.  ....  N  .  Thus,  within  the  basic  theory  of  Markov- renewal 
processes,  there  is  a  fundamental  distinction  bebiveen  behavior  of  state 
probabilities  from  transition  to  transition,  and  behavior  over  time.  This 
distinction  is  also  well-known  in  the  study  of  queueing  problems  in  which  the 
method  of  regeneration  points  is  used. 

We  are  thus  forced  to  the  conclusion  that  v'hen  considering  Markov- 
renewal  programs  with  an  infinite  horizon,  one  must  decide  whether  the 
system  will  be  operated  for  an  infinite  number  of  transitions,  or  for  an 
infinite  period  of  time  ! 

A  pertinent  question  is,  when  v  ill  these  two  experiments  converge  on 
the  same  stationary  policy?  From  (48),  a  sufficient  condition  is  that 

be  independent  cf  the  possible  policies;  this  v/ould  certainly  be  true, 
if  for  every  policy,  and  for  every  pair  of  states,  the  were  all  equal  to 
1/  .  Thus,  for  a  Markov  decision  process,  G  =  gi/  ,  independent  of  the  policy 
under  consideration.  Of  course,  for  some  Markov-renewal  programs, 
the  policies  may  be  identical  because  of  the  data  of  the  problem. 

XV.  Multiple  Chain  and  Transient-State  Problems 

It  is  possible  to  extend  the  analyses  of  the  previous  sections  to 
problems  where  the  underlying  Markov  chain  has  several  recurrent  classes 
or  where  some  of  the  states  are  transient  or  absorbing. 

The  problem  of  multiple  chains  has  been  discussed  extensively  by 

r  9  1 

Howard.*’  ^  The  primary  change  in  the  algorithms  of  Figures  Z  and  3  are  the 
determination  of  a  separate  gain  (or  gain  rate)  for  each  class  of  recurrent 
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states.  The  policy  determination  routine  selects  alternatives  in  order  to 
maximize  the  average  gain  "reached"  from  state  i  ;  if  there  are  ties,  a 
test  quantity  including  the  relative  values  is  used.  It  should  be  emphasized 
that  the  procedure  does  not  break  ties  between  the  optimal  policies  which 
maximize  gain  within  each  chain,  but  merely  break  ties  within  the  algorithm 
which  leads  to  one  of  these  policies.  For  more  details,  the  reader  is  referred 
to  Reference  [9];  the  necessary  changes  for  Markov-renewal  programs  can 
easily  be  deduced. 

If  there  are  transient  states  in  the  underlying  Markov  chain,  then  it 
is  a  simple  matter  to  determine  the  expected  number  of  steps  until  a  given 
state  in  one  of  the  recurrent  chains  is  enteres.^  The  average  return 
accumulated  en  route  to  absorption  in  the  recurrent  chain  is  then  added  to  the 
return  of  the  recurrent  state  (s)  entered,  in  the  obvious  manner;  further 
details  are  left  to  the  reader.  Absorbing  states,  for  our  purposes,  may  be 
treated  like  a  one-state  recurrent  chain. 

Once  again  it  is  emphasized  that  these  special  considerations  relative 
to  the  underlying  Markov  chain  are  necessary  only  in  limiting  programs; 
when  discounting  is  present,  or  when  the  process  has  a  finite  horizon,  there 
is  no  difficulty. 

XVI.  Limiting  Results 

The  analyses  presented  here  may  be  easily  specialized  to  the 
results  of  Markov-decision  processes.  As  an  example,  we  specialize  the 
results  of  the  section  on  discounted  programs  to  the  continuous-parameter 
Markov  process  analyzed  by  Howard  in  Chapter  8,  Reference  [  9  ] . 

For  a  continuous -parameter  Markov  process,  p. .  s  0  ,  and 
-w.  t 

F^.(t)  s  1  -  e  ^  ,  for  all  i,  j  ,  and  appropriate  finite  >  0  . 
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Thus,  +  «j)  ,  and  from  (6): 

N 


Pi(a)  =  y  Pij  {  Ry  +  Trns:  'y  ) 

j=l  "  '  ' 

j/i 


(50) 


for  all  i  ,  when  the  rewards  R.j  are  earned  at  the  end  of  the  transition 
interval.  Making  the  running  rewards  only  dependent  upon  i  , 
r^j  =  r^  ,  the  expected  return  (16)  must  satisfy: 


N  N 

(or  +  «.)V.  =  r.  +  w.  y  p..R..  +  w.  V  Pi-V. 
11  1  1  Z,  iJ  iJ  1  Z  iJ  J 


j=l 


(51) 


for  all  i  .  Setting: 


a. .  =  w.  p. . 
ij  1  *^iJ 


i  j 


a..  =  -u. 
11  1 


we  finally  obtain,  for  i  =  1,2,  .  .  .  ,N  ; 


oV 


N  N 

.  =  r.  +  y  a..R..  h  y  a..V. 

1  1  Z.  iJ  iJ  Z.  iJ  J 

j=l 


j=l 

j^i 


which  is  essentially  Equation  (8.  47)  of  Reference  [  7  ] 


(52) 
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XVII.  The  Two>Stote  Proceeeee 


As  a  specific  case  of  interest,  we  present  explicit  results  for  Markov* 
renewal  programs  with  only  two  states.  First,  express  the  transition 
probabilities  in  terms  of  the  off-diagonal  elements: 


1 


r  = 


'Pl2 


LP2I 


P12 

^  'P2I. 


(53) 


which  represents  an  ergodic  chain  for  Pj2  Pgi  The 

stationary  transition  probabilities  for  the  Markov  chain  are: 


ir 


<Pl2  +  P2l^  ^  ^P21  '  Pl2^ 


(54) 


which,  when  used  to  calculate  the  fundamental  matrix,  gives: 


Z  -n  = 


<Pl2  +  P21> 


P12 

-P2I 


P2I  J 


(55) 


Similarly,  the  stationary  probabilities  over  time  for  the  Markov- renewal 
process  are: 


-1 


P  =  <*'lP21  +  ‘'2Pl2>'  <‘'lP2l  '  ‘'2Pl2> 


(56) 


The  mean  first-passage  times  are: 


(Pij)  = 


»^lP21  ^  »^2P12 

P2I 

IL 

P2I 


P12 

»^lP21  ^  *^2P12 
P12 


(57) 
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while  the  diagonal  terms  of  the  second  moment  of  first-passage  times  are: 


^  (4^^  +  ^^zz^'zz^zi] 

(58) 

A^Z  =  4^^  +  ^^zV'zi^ii  ^  (4^^  +  ^Pll^'ll'^iz) 

The  limiting  value  of  the  per -transition  gain  is  then 

p>iPi  ■*■  Plz^Z 

^  =  -'i-Tp---  :  <59) 

Pzi  ^  Pl2 


while  the  infinite  -  step  relative  values  from  (31)  are  just 


V.  = 


Pi  -  Pz 


1  PZI  +  Pl2 


V,  =  0 


(60) 


The  limiting  value  of  the  gain  rate  is: 


g  = 


PziPj  ^  P12P2 

‘'iPzi  +  ‘'2P12 


(61) 


with  the  infinite-time  or  vanishing-discount  relative  values  from  (44): 


■^ZPl  -  »^lP2 
■  ^PZI  +  *'ZP12 


(62) 


From  these  explicit  formulae,  the  optimal  policy  can  be  found  by 
direct  evaluation,  if  the  number  of  alternatives  is  not  too  large.  The  exact 
constant  terms  in  each  case  can  be  then  found  from  the  relative  values  and 
(32),  or  (45),  or  (46)(47). 
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XVin.  An  Example 


As  an  Illustration,  consider  the  two- state  problem  of  a  machine  that 
is  running  (state  1)  or  has  broken  down  (state  2).  If  the  machine  is  running 
there  are  two  maintenance  alternatives: 

Alternative  A:  ®  *  SlOO/day  ;  =  4  days  ; 

Pl2=l  . 

Alternative  B:  ®  ~  ®  84/day  ;  =  5  days  ; 

Pl2=l  . 

If  the  machine  has  broken  down,  there  are  two  repair  alternatives: 

Alternative  A;  ^21  “  ®  "  -$65/day  ;  days  ; 

P2i=l  . 

Alternative  B:  ^21”  *^100  ’’^21”  '^ZOO/day  ;  •'21“^  ’ 

P12  =  ^  • 

Alternative  B  may  be  thought  of  as  an  outside  repairman  whose  expensive 
fixed  and  running  charges  are  compensated  for  by  his  quick  service  time- 
The  finite-  and  infinite-step  processes  without  discounting  are  in¬ 
dependent  of  the  complete  transition  time  distributions,  and  depend  only  on 
the  means,  i/^j  .  By  direct  evaluation: 

=  $70  ;  =  $50  ;  G®^  =  $80  ;  G®®  =  $60  , 

so  that  (B,  A)  (expensive  maintenance,  cheap  repair)  is  the  optimal  stationary 
policy  for  a  large  number  of  transitions-  It  is  also  the  optimal  policy  for 
all  values  of  n  -  Figure  4  shows  total  expected  return  as  a  function  of  n  , 
starting  in  either  state  1  or  2.  The  exact  expressions  are: 
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Vj(n)  =  80n  +  170  +  170 

V^{n)  =  80n  -  170  -  170  . 

When  these  fluctuations  are  averaged  out,  the  last  term  vanishes  in  each 
equation,  so  that  the  limiting  form  has  G  s  $80  ,  and  W^  =  $170  , 

W2  =  *$170  .  These  values  may  be  checked  from  (32);  (31)  will,  of  course, 
only  produce  Vj  =  $340  ,  =  0  . 

In  the  finite-  and  infinite-time  processes  without  discounting,  one 
must  specify  the  form  of  the  transition-time  distributions;  let  us  suppose 
that  they  are  all  degenerate,  with  the  given  means.  Figure  5  shows,  in 
heavy  lines,  the  return  obtained  when  following  the  optimal  (nonstationary) 
policy  for  all  t  .  The  optimal  policy  itself  is  indicated  by  means  of  solid 
bars  above  and  below  the  return  curves;  notice  that  by  t  s  19  44/149  ,  the 
optimal  policy  in  state  2  has  stabilized  to  policy  B,  but  that  the  return  curves 
and  the  policy  in  state  1  do  not  stabilize  until  t  s  30  .  At  this  point,  the 
optimal  policy  in  state  1  is  to  use  either  A  or  B  ,  and  the  return  curves 
have  the  form: 

Vj(t)  =  20t  +  301  +  Uj(t) 

V^lt)  =  20t  -  18^+U2(t) 

for  t  >  30  .  U2(t)  and  u^it)  are  sawtooth  curves  of  period  one,  whose 
time-average  value  is  zero. 

By  direct  evaluation, 

g^  =  17.50  ;  g^®  =  20.0  ;  g®^  =  17.77  ;  g®®  =  20.0 

in  dollars  per  day. 
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•othat«ith«r  (A.B)  or  (B»B)  oro  tho  maximol  gain  roto  atatioaary  polieios, 
indicating  that  axpansiva  repair  ehould  always  be  used  I  For  eomparieon. 
Figure  S  also  shown  the  total  returns  obtained  when  the  stationary  policies 
are  followed  for  all  time.  For  policy  (AtB)  (dotted  line): 

Vj(t)  ■  20t  +  150  +  UjCt) 

V2(t)  »  20t  -  170  +  u^Ct) 

and  for  policy  (B,  B)  (dashed  line): 

Vj(t)  =  20t  +  151  ^  +  0^(0 
V2(t)  a  20t  -  168y  +  u^(t) 

for  all  t  >  0  ,  where  Uj(t)  ,  u^(t)  ,  Ug(t)  ,  and  u^(t)  are  all  sawtooth  curves 

of  period  five  (A.B)  ,  or  six  (B,  B)  .  whose  time-average  values  are  sero. 

The  results  can  be  obtained  graphically,  or  from  (40)  and  (41).  A  surprising 
result  is  that  while  both  (A.  B)  and  (B.  B)  are  the  limiting  stationary 
policies  of  the  optimal  nonstationary  policy,  only  (B,B)  is  the  optimal, 
completely  stationary  policy,  in  the  sense  of  maximising  the  w^  ,  when  there 
is  a  tie  in  the  gain  rates,  g  .  This  resolution  of  the  tie  cannot  be  found 
from  (44),  which  gives  relative  values  v^^  *  320  •,  -  0  ,  for  both  policies. 

To  illustrate  the  effect  of  the  distribution  shape  upon  the  optimal 
policy,  consider  the  infinite -hor icon  problem  with  discounting  for  the 
following  distributions: 

I.  All  distributions  degenerate 

n.  and  e^qponential,  F^  and  F^ 

degenerate. 

III.  exponential,  F^  degenerate  . 
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In  each  case,  the  meana  are  maintained  at  the  values  previously  given.  The 
resulting  normalised  optimal  discounted  returns,  oV^  and  lire  shown 

in  Figure  6  versus  a  ;  changes  in  policy  for  different  regions  of  a  are 
indicated  by  a  vertical  bar.  Notice  that  any  of  the  four  possible  policies  may 
be  optimal,  depending  upon  the  discount  factor  and  the  distribution  assumed. 
As  a  0  ,  either  AB  or  BB  is  selected  as  the  optimal  policy,  but  there 
are  no  near -optimal  policies  in  the  cases  shown,  even  though  they  all  tie  in 
gain  rate  at  the  limit. 

XIX.  Summary 

Besides  the  extensions  of  the  model  to  different  classes  of  underlying 
Markov  chains  which  have  already  been  mentioned,  there  are  other  modifica¬ 
tions  which  can  be  made: 

1.  A  general  reward  structure,  Ry(t,  Ty)  0  <  t  < 

2.  Termination  rewards  which  depend  upon  the  time  until 
the  next  transition. 

3.  "Mixed"  horizons,  in  which  the  process  is  terminated 

at  the  next  transition  after  time  T  ,  or  at  min  (n,  T),  etc. 

These  modifications  do  not  change  the  solution  algorithms  in  any  substantial 
way. 

A  more  difficult  problem  is  the  resolution  of  ties  in  infinite -horizon 
problems.  Since  both  the  gains  and  the  relative  values  must  be  used  in  the 
algorithm  to  find  a  new  policy  which  improves  the  gain,  it  would  appear  tliat 
resolution  of  ties  would  require  knowledge  about  the  transient  part  of  the 
ej^ected  total  reward.  Another  approach  would  be  to  use  a  secondary 
criterion,  such  as  minimum  variance,  to  resolve  the  ties. 

In  summary,  we  have  considered  an  extension  of  previous  work  in 
Markov-decision  processes  into  models  which  have  Markov-renewal 
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Figure  6  —  Normalised  expected  discount  return  versus  discount  factor 
for  machine  maintenance  and  repair  example,  showing  the 
effect  of  different  transition-time  distributions. 
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•trueturas.  It  if  fflt  that  thff*  modfls  fmbraca  a  widar  clasf  of  important 
oparational  problamt,  since  in  a  Markov-renewal  process  the  times  between 
transitions  can  follow  a  random  clock  which  depends  on  both  the  previous 
and  the  next  state  of  the  system.  The  policy- space  algorithm  remains  much 
the  same  as  in  the  Markov  models;  but  now  a  fundamental  distinction  appears 
in  infinite  programs:  are  they  infinite  in  time,  or  in  number  of  transitions? 
Clarification  of  this  distin^^tion  appear  to  be  a  fundamental  part  of  Markov- 
renewal  programs,  and  it  will  be  of  interest  to  see  how  this  distinction  will 
be  reflected  in  application.! 

i 
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APPENDIX  A 


Here  are  preeented  some  results  on  Markov-renewal  processes  due 
to  Pyke^  15  ]  [  16  ]  [  17  ]  [  18  ]  Barlow^  ^  ^  which  are  used  in  the  text. 

The  basic  function  is  the  joint  conditional  probability  distribution; 

Qy(t)  =  PijF..(t)  =  Pr  =  j  .  T(i.j)  <  t|ij^  =  i}  (A.  1) 

which  is  defined  for  all  i  and  j  ,  t  >  0  ,  and  k  =  0,  1,  2, .  . , 

Let  u  represent  the  first  passage  time  to  state  j  ,  starting  at  state 
i  :  from  the  definition  of  the  transition  times: 

Uy  =  T(i,j) 

orT(i,k)  +  T(k,j)  (k  j) 

orT(i,k)  +  T(k,  i)  +  T(i,j)  (k.i^j) 

where  each  sequence  is  determined  by  the  underlying  p^j  .  With  this  definition 
the  first  passage  time  from  i  back  to  itself  does  not  require  the  system  to 
enter  another  state  first.  If  the  distribution  function  of  first -pas sage  times 
is  denoted  by: 

G..(t)  =  Pr  {u,.  <  t}  (t  >  0)  (A. 2) 

'J  ~  (i.r=1.2 . N) 

then  a  simple  renewal  argument  will  give  the  following  relationship  between 
the  Gg  and  the  Q.j  ; 

N  J 

G..(t)  =  Q..(t)  +  y  r  G.  .(t  -  x)dQ..(x)  (t  >  0)  (A.  3) 

j^tl  ^  (i.r=1.2 . N) 

k^j 
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G|j(0)  =  0  because  of  the  corresponding  restriction  on  the  •  The  first 

and  second  moments  of  the  first-passage  times  will  be  denoted  by  and 
,  respectively:  these  need  not  be  finite,  in  general. 

A  variable  of  interest  in  Markov-renewal  processes  in  N.(t)  ,  the 

J 

number  of  times  the  system  enters  state  j  in  the  interval  (0,t].  In  particular, 
the  mean  number  of  entries  into  state  j  in  the  interval  is  defined  as: 

M.  .(t)  =  E  {N,(t)  I  ifl  =  i  }  (t  >  0)  (A.  4) 

^  °  (i.r=  1.2 . N) 

From  the  definition  of  first-passage  times: 

M..{t)  =  G.,(t)  +  C  G..(t  -  x)dM,.(x)  .  (t  >  0)  (A.  5) 

'J  Jo  "J  (i,j  =  l,2 . N) 

Finally,  there  is  an  interesting  relationship  between  the  and  the  : 

N  ^ 

M..(t)  =  Q..{t)  +  ^  r  Q..  (t  .  x)dM,  .(x)  (t  >  0)  (A.  6) 

k=l  (i.j=1.2 . N) 

In  particular,  note  that  for  a  one-state  process,  G(t)  =  Q(t)  =  F(t)  ,  and 
(A.  5)  and  (A.  6)  reduce  to  the  well-known  equation  from  renewal  theory^ 

M(t)  =  F(t)  +  C  F(t  -  x)dM(x)  .  (t  >  0)  (A.  7) 

Jo 

It  is  this  Intimate  relation  with  both  Markov  processes  and  renewal  processes 
which  led  Pyke  to  define  these  processes  as  Markov-renewal.  Asymptotic 
properties  of  the  M..(t)  used  in  Equation  (39)  may  be  found  in  References 
[  lb  ]  and  [  18  ] . 
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Methods  for  the  solution  of  the  first-passage  moments  are  needed  in 
Equations  (40)  and  (45)  of  the  main  text.  From  (A.  3),  or  from  Reference  [1]:* 


I/. 

i 


I 

r=1 


Pik  *'ik 


for  all  i,j  .  Also, 


(A.  8) 


N 

N 

I  Pik^^^kj^  +  ; 

■  *  1  Pik'-lk 

k=l 

ksl 

k^j 

for  all  i.  j  . 

Both  means  are  finite  if  the  first  and  second  means  of  the  F..(t) 

ij' 

are  finite,  and  the  underlying  chain  is  ergodic.  In  this  case,  the  equations 
above  are  always  well-defined. 

If  Equations  (A.  8)  are  multiplied  by  the  stationary  probabilities,  , 
and  summed,  there  results  the  interesting  relationship  for  all  j  : 


jj 


_l^ 

Wj 


N 

1  ’k”!, 


k=l 


(A.  10) 


There  are  several  typographic  errors  in  [1]  onpp.  53,54.  The  equation 
before  their  (A.  3)  should  have  the  indices  of  reversed;  the  condition  on 

the  second  summation  in  (A.  4)  should  read  k;ir;  and  the  equation  after  (A.  4) 
should  have  ir^  as  denominator  of  both  terms  on  the  right-hand  side. 
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In  a  similar  manner,  from  (A.  9),  for  all  j  ; 


J  k=l  k=:l  i=l 


(A.  11) 


When  V.  =  i/!^^  =  1  ,  these  formulae  reduce  to  well-known  relationships  for 
Markov  chains. 


APPENDIX  B 

In  an  unpublished  report  by  P.  Schweitzer  of  M.I.T.  (private  communi¬ 
cation;  March  13,  1963),  an  alternative  test  criterion  has  been  proposed  for  a 
quantized  version  of  the  infinite  time,  undiscounted  case.  His  criterion  is  to 
select,  for  each  state  i  ,  the  alternative  z(i)  which  maximizes 


N 


z  .  z  z 

Pi  *  ;  Py  Vj  -  gPi 

i-1 

which  may  be  contrasted  with  out  test  criterion  of  Figure  3; 


(B.l) 


1  2  , 

^  Pi  +  . 
^  j=l 


N 

y  p 


z  "1 

.  .V.  -  V.  V 

IJ  J  1  j 


(B.2) 


Suppose  one  had  a  policy  (vector)  A  which  led  to  an  improved  policy 

B  ,  and  let  E  >  0  be  the  improvement  in  Schweitzer's  test  criterion  for  state  i 

B  A 

and  let  0  be  the  improvement  in  our  test  criterion.  Then  if  A  g  =  g  .  g 

is  the  improvement  in  gain  rate  between  policies  A  and  B  ,  it  can  be  shown  that: 


N 

=  T  -4  p' 

—  1*  nf 


T. 


B 


and 


Ag 


■  1  v- 

J=1  J 


N 

^  y.  P.^ 

/,  J  J 

j=l 


(B.3) 


In  both  cases  the  relative  values  v.  and  the  gain  rate  g  of  the  present  policy 
are  used.  ^ 
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Since  the  underlying  Markov  chain  is  ergodic,  and  all  of  the  are  assumed 
finite,  both  criteria  lead  to  an  improved  policy.  It  is  not  known  if  one  of 
them  is  computationaly  more  efficient  than  the  other. 
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